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Abstract —Techniques based on randomized response enable 
the collection of potentially sensitive data from clients in a 
privacy-preserving manner with strong local differential privacy 
guarantees. One of the latest such technologies, RAPPOR (9), 
allows the marginal frequencies of an arbitrary set of strings 
to be estimated via privacy-preserving crowdsourcing. However, 
this original estimation process requires a known set of possible 
strings; in practice, this dictionary can often be extremely large 
and sometimes completely unknown. 

In this paper, we propose a novel decoding algorithm for the 
RAPPOR mechanism that enables the estimation of “unknown 
unknowns,” i.e., strings we do not even know we should be 
estimating. To enable learning without explicit knowledge of the 
dictionary, we develop methodology for estimating the joint dis¬ 
tribution of two or more variables collected with RAPPOR. This 
is a critical step towards understanding relationships between 
multiple variables collected in a privacy-preserving manner. 

I. Introduction 

It is becoming increasingly commonplace for companies 
and organizations to analyze user data in order to improve 
services or products. For instance, a utilities company might 
collect water usage statistics from its users to help inform fair 
pricing schemes. Although analyzing user data can be very 
beneficial, it can also negatively impact user privacy. Data 
collection can quickly form databases that reveal sensitive 
details, such as preferences, habits, or personal characteristics 
of implicitly or explicitly identifiable users. It is therefore 
important to develop methods for analyzing the data of a 
population without sacrificing individuals’ privacy. 

A guarantee of local differential privacy can provide the 
appropriate privacy protection without requiring individuals 
to trust the intentions of a data aggregator M- Informally, 
a locally differentially-private mechanism asks individuals to 
report data to which they have added carefully-designed noise, 
such that any individual’s information cannot be learned, 
but an aggregator can correctly infer population statistics. 
The recently-introduced Randomized Aggregatable Privacy- 
Preserving Ordinal Response (RAPPOR) is the first such 
mechanism to see real-world deployment ||9]- 

RAPPOR is motivated by the problem of estimating a 
client-side distribution of string values drawn from a discrete 
data dictionary. Such estimation is useful in many security- 
related scenarios. For example, RAPPOR is reportedly used 
in the Chrome Web browser to track the distribution of users’ 
browser configuration strings; this is done to detect anomalies 
symptomatic of abusive software iiiiiiiii. 


Unfortunately, in its current state, the RAPPOR technology 
can be of only limited applicability and utility. This is be¬ 
cause RAPPOR makes two simplifying assumptions that will 
certainly not always hold in practice: 

Assumption 1: Aggregators only need to learn the dis¬ 
tribution of a single variable, in isolation. In prac¬ 
tice, aggregators may want to study the association 
between multiple variables because attributes are often 
more meaningful in association with other attributes. 
For example, in RAPPOR’s application domain in the 
Chrome Web browser, an innocent-looking homepage or 
search-provider URL may become highly suspect if its 
use is strongly correlated with installation of software 
that is known to be malicious. 

Assumption 2: Aggregators know the data dictionary of 
possible string values in advance. There are many 
scenarios in which both the frequencies of client-side 
strings and the strings themselves may be unknown. For 
instance, when collecting reports on installed software, 
it is unlikely that the names or hash values of all 
software will be known ahead of time, especially in the 
face of polymorphic software. Similarly, when study¬ 
ing user-generated data—manually-entered hashtags, for 
instance—the dictionary of possible strings cannot be 
known a priori. 

Lifting these two simplifying assumptions is a signifi¬ 
cant challenge, which requires reasoning about “unknown 
unknowns.” The first assumption can only be removed by 
estimating the unknown joint distributions of two or more 
unknown variables that are observed only via differentially- 
private RAPPOR responses. Removing the second assumption 
requires learning a data dictionary of unknown client-side 
strings whose frequency distribution is also unknown. This 
process must additionally satisfy strong privacy guarantees 
that preclude the use of encryption or special encodings 
that could link individuals to strings. Furthermore, neither 
of these challenges admits a solution that is simultaneously 
feasible and straightforward. The naive approach of trying 
all possibilities incurs exponential blowup over the infinite 
domain of unknown strings, and is not even well-defined with 
regards to estimating joint distributions. 

This paper provides methods for addressing these two 


challenges, thereby substantially improving upon the recently- 
introduced RAPPOR statistical crowdsourcing technology. 

First, regarding multivariate analysis, we present a collec¬ 
tion of statistical tools for studying the association between 
multiple random variables reported through RAPPOR. This 
toolbox includes an expectation-maximization-based algorithm 
for inferring joint distributions of multiple variables from 
a collection of RAPPOR reports. It also includes tools for 
computing the variance of the distribution estimates, as well 
as testing for independence between variables of interest. 

Second, regarding unknown data dictionaries, we introduce 
a novel algorithm for estimating a distribution of strings 
without knowing the set of possible values beforehand. This 
algorithm asks each reportee to send a noisy representation 
of multiple substrings from her string. Using our previously- 
developed techniques for association analysis, we build joint 
distributions of all possible substrings. This allows the ag¬ 
gregator to learn the data dictionary for all frequent values 
underlying the reports. 

If differential privacy is to gain traction outside the research 
community, we believe it is critical to tackle the practical 
challenges that currently limit its immediate usefulness; in this 
work, we address two such challenges. We demonstrate the 
practical efficacy of both contributions through simulation and 
real-world examples. For these experiments we have created 
implementations of our analysis that we are making publicly 
available (deferred for blind review). While motivated by the 
recently-introduced RAPPOR mechanism, and presented in 
that context, our contributions are not unique to the RAPPOR 
encoding and decoding algorithms. Our methods can be easily 
extended to any locally differentially-private system that is 
attempting to learn a distribution of discrete, string-valued 
random variables. 


II. Background 

A common method for collecting population-level statistics 
without access to individual-level data points is based on 
randomized response ll23l . Randomized response is an obfus¬ 
cation technique that satisfies a privacy guarantee known as 
local differential privacy El. We begin by briefly introducing 
local differential privacy and explaining how the RAPPOR 
system uses randomized response to satisfy this condition. 

Formally, a randomized algorithm A (in this case, RAP¬ 
POR) satisfies e-differential privacy m if for all pairs of 
client’s values xi and X 2 and for all R C Range{A), 

PiA{xi) gR)< ffP{A{x2) e R). 

Intuitively, this says that no matter what string user Alice 
is storing, the aggregator’s knowledge about Alice’s ground 
truth does not change too much based on the information she 
sends. We would like to emphasize that differential privacy is 
a property of an encoding algorithm, so these guarantees hold 
regardless of the underlying data distribution. 

RAPPOR is a privacy-preserving data-collection mechanism 
that makes use of randomization to guarantee local differential 
privacy for every individual’s reports. Despite satisfying such 


a strong privacy definition, RAPPOR enables the aggregator 
to accurately estimate a distribution over a discrete dictionary 
(e.g., a set of strings). 

The basic concept of randomized response is best explained 
with an example. Suppose the Census Bureau wants to know 
how many communists live in the United States without 
learning who is a communist. The administrator asks each 
participant to answer the question, “Are you a communist?” 
in the following manner: Flip an unbiased coin. If it comes 
up heads, answer truthfully. Otherwise, answer ‘yes’ with 
probability 0.5 and ‘no’ with probability 0.5. In the end, the 
Census Bureau cannot tell which people are communists, but 
it can estimate the true fraction of communists with high 
confidence. Randomized response refers to this addition of 
carefully-designed noise to discrete random values in order to 
mask individual data points while enabling the computation 
of aggregate statistics. 

RAPPOR performs two rounds of randomized response to 
mask the inputs of users and enable the collection of user 
data over time. Suppose Alice starts with the string X (e.g., 
X = “rabbit”). The sequence of events in the encoder is as 
follows: 

1) Hash the string X twice (h times in general) into a fixed- 
length Bloom filter, B. 

2) Pass each bit in the Bloom filter Bi through a random¬ 
ized response (giving B^) as follows: 


{ 1, with probability 

0, with probability 

Bi^ with probability 1 — / 

where / is a user-tunable parameter controlling the level 
of privacy guarantees. We refer to this noisy Bloom filter 
B' as the permanent randomized response (PRR) for the 
value X, because this same B' is to be used for both 
the current and all future responses about the value X. 

3) Each time the aggregator requests a report, pass each 
bit i?' in the PRR through another round of randomized 
response (giving A'), as follows: 


P{Xi = 1 ) 


g, ifS' = l. 
P, if B[ = 0. 


We refer to this array of bits X' as an instantaneous 
randomized response (IRR), and the aggregator only 
ever sees such bit vectors X' for any value X that is 
reported. The smaller the difference between q and p 
(user-tunable), the greater privacy guarantee is provided. 
This process is visualized in Figure [T] 

The RAPPOR encoding scheme satisfies two different e- 
differential privacy guarantees: one against a one-shot adver¬ 
sary who sees only a single IRR, and one against a stronger 
adversary who sees infinitely many IRRs over time. The latter 
adversary is able to reconstruct B' with arbitrary precision 
after seeing enough reports, which motivates the need for a 
PRR, but is unable to infer B from a single copy of B'. In 
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Fig. 1. Visualization of the RAPPOR mechanisms from Each user starts by hashing her true string into a Bloom filter B. This representation B is used 
to generate a permanent randomized response (PRR) B'. Each time the aggregator collects data (this might happen daily, for instance), the user builds a new 
instantaneous randomized response X' from B' and sends it to the aggregator. 


principle, users could always report B' at every data collection, 
but this would create a unique tracking identifier. 

If the set of possible strings is small and known prior to 
collection (e.g., country, gender, etc.), a simplified version of 
the algorithm, called Basic RAPPOR, is more appropriate. The 
single difference is that in step (1), Basic RAPPOR does not 
make use of Bloom filters, but deterministically assigns each 
string to its own bit {h = 1). In this case, the size of B is 
determined by the cardinality of the set being collected. This 
also significantly simplifies the inference process to estimate 
string frequencies by the aggregator. 

Despite strong report-level differential privacy guarantees, 
RAPPOR can approximate the marginal distribution of the 
measured variable(s) with high precision. One high-utility 
decoding scheme is described in ||9l, but the details of marginal 
decoding are not critical to understanding of our present work. 


III. Estimating Joint Distributions 


Learning the distribution of a single variable is sometimes 
enough. For example, if the aggregator’s goal is to learn the 
100 most popular URLs visited by clients, then a straight¬ 
forward application of the RAPPOR algorithm described in 
Section [11] suffices. 

More often, however, aggregators may be interested in 
learning the associations and correlations between multiple 
variables, all collected in a privacy-preserving manner. For 
example, suppose we would like to understand the relationship 
between installed software and annoying advertisements, e.g., 
to detect the presence of so-called adware. To do so, we 
might study the association between displayed advertisements 
and recently-installed software applications or extensions. If 
both of these variables are measured using the RAPPOR 
mechanism, the current literature does not describe how to 
estimate their joint distribution, although methods exist for 
estimating marginal frequencies of both variables individually. 

In this section, we describe a general approach to estimat¬ 
ing the joint distribution of two or more RAPPOR-collected 
variables. Inference is performed using the expectation- 
maximization (EM) algorithm 0, which produces unbiased 
estimates of joint probabilities. These joint estimation tech¬ 


niques will play a key role in Section IV where we estimate 
data distributions over unknown dictionaries. 


A. Estimating Joint Distributions with the EM Algorithm 

The EM algorithm is a common way to obtain maximum 
likelihood estimates (MLEs) in the presence of missing or 
incomplete data. It is particularly suited to RAPPOR applica¬ 
tions where true values are not observed (missing) and only 
their noisy representations are being collected. 

For the sake of clarity, we will focus on estimating the joint 
distribution of two random variables X and Y, both collected 
using Basic RAPPOR introduced in Section |I^ Extending this 
estimation to general RAPPOR requires careful consideration 
of unknown categories and will be discussed in the next 
section. Let X' = RAPP0R(A:) and Y' = RAPPOR(U) be 
the noisy representations of X and Y created by RAPPOR. 
Suppose that N pairs of X' and Y' are collected from N 
distinct (independent) clients. 

For brevity, let Xi and Yj denote the events that X = Xi 
and Y = yj, respectively. The conditional probability of true 
values X and Y, given the observed noisy representations X' 
and Y', is just a consequence of Bayes’ theorem: 


P{X = x,,Y = y,\X',Y') 


p^jPjX',Y'\X,,Y,) 

m n 

E Y.PMP{X',Y'\Xu,Yi) 

k=l t=l 


where m and n are the number of categories in X and Y, 
respectively. Here, pij is the true joint distribution of X and Y ; 
this is the quantity we wish to estimate for each combination 
of categories i and j. P{X', Y'\X, Y) is the joint probability 
of observing the two noisy outcomes given both true values. 
Because X' and Y' are conditionally independent given both 
a: and Y, P{X',Y'\X,Y) = P{X'\X)P[Y'\Y). Since the 
noise added through RAPPOR is predictable and mechanical, 
it is easy to precisely describe these probabilities. Without loss 
of generality, assume that X = a;i. In Basic RAPPOR, cci’s 
Bloom Filter representation has a one in the first position and 
zeros elsewhere, so we have 


P{X'\X = xi) = 

X ... X p<{1-pY-< 

X ... 


The EM algorithm proceeds as follows: 































1) Initialize 1 < i < m,l < j < n (uniform 

distribution). 

2) Update with 

pI+^ = P{X = Xi,Y = yj) 

1 ^ 
fc=i 

where P{X = Xi,Y = yj\X'f,,Yl) is computed using 
the current estimates p\y 

3) Repeat step 2 until convergence, i.e. max^ \pI^^ — 
Pij I < for some small positive value. 

This algorithm converges to the maximum likelihood esti¬ 
mates of Pij, which are asymptotically unbiased. 

B. Handling the “Other” category 

In the EM initialization step, we assume that we know all 
n categories of X and all m categories of Y. In practice, 
the aggregator is unlikely to know all the relevant categories, 
and must make choices about which categories to include. 
Operationally, the aggregator would perform marginal analyses 
on both X' and Y' separately, estimate the most frequent 
categories, and use them in the joint analysis. The remaining 
undiscovered categories, which we refer to as “Other”, cannot 
be simply omitted from the joint analysis because doing so 
leads to badly biased distribution estimates. In this section, 
we discuss how to handle this problem. 

Suppose one ran the marginal decoding analysis separately 
on X' and Y', thereby detecting m and n top categories, 
respectively, along with their corresponding marginal frequen¬ 
cies. Note that m and n now represent the detected numbers 
of categories instead of the true numbers of categories. The 
“Other” categories for X and Y may constitute a signihcant 
amount of probability mass (computed as 1 — Y^^iPi and 
1 — jy^=iPj’ respectively) which must be taken into account 
when estimating the joint distribution. 

The difficulty of modeling the “Other” category comes from 
the apparent problem of estimating 

P{X'= x'\X = ^^Other”), (1) 

i.e. the probability of observing a report x' given that it was 
generated by any category other than the top m categories of 
X. However, if we could estimate this probability we could 
simply use the EM algorithm in its current form to estimate 
the joint distribution—an (m + l) x (n -f 1) contingency table 
in which the last row and the last column are the “Other” 
categories for each variable. 

We use knowledge of the top m categories and their 
frequencies to estimate the probability in Q. Let c™ be the 
expected number of times that reported bit s was set by one 
of the top m categories in X. It is equal to 

cr = ((i-0, + f)T.+((i-{)p + §)(iv-r,) 


where N is the number of reports collected and 

ra 

Ts=NY,pJ{Bs{,Xi) = 1) 
i=l 

represents the expected number of times the sth bit in N 
Bloom filters was set by a string from one of the top m 
categories. Here, I is the indicator function returning 1 or 0 
depending if the condition is true or not, B{xi) is the Bloom 
hlter generated by string Xi and pi is the true frequency of 
string Xi- 

Given the above, the estimated proportion of times each bit 
was set by a string from the “Other” category is then 

-o ^ - cTip^) 

ps N{i-j:T=iP^y 

where is the observed number of times bit s was set in all 
N reports. 

Then, the conditional probability of observing any report 
X' given that the true value was “Other” is given by 

k 

P{X' = x'\X = ^‘Other”) = J] (1 - . 

S = 1 

As stated earlier, we can use this estimate to run the EM 
algorithm with “Other” categories, thereby obtaining unbiased 
estimates of the joint distribution. 

C. Estimating the Variance-Covariance matrix 

It is a well-known fact in statistics that the asymptotic dis¬ 
tribution of the maximum likelihood estimates (pn,... ,Pmn) 
is 

N {fyPllT ■ ■ ■ ^Pran) T I )i 

where N{p,, E) stands for a Gaussian distribution with mean 
pL and variance-covariance matrix E and / is the information 
matrix defined below. (See m .) A good estimate of E is 
critical, as it allows us to assess how certain we are about our 
estimates of p^’s. It permits an aggregator to construct 95% 
confidence intervals, rigorously test if any of the proportions 
are different from 0, or perform an overall test for the 
association between X and Y. 

In this case, the asymptotic variance-covariance matrix is 
given by the inverse of incomplete-data observed information 
matrix lots. To obtain an estimate of the information matrix, 
we would evaluate the second derivative of the observed-data 
log-likelihood function at our MLE estimates p^’s. 

The log-likelihood function is the log of the probability of 
observing all N reports, treated as a function of the unknown 
parameter vector (pn,... ,Pmn)- 

N I m n \ 

l{pil, . . . ,Pmn) = [Y,Y.p^,p{x'„y!,\x,,y,)\ . 

k=i \i=i j=i ) 

The first derivative with respect to pij is given by 

p _V P{XlYi\X = x,,Y = y,) 

h Sr=i ELi P^,P{X'„Yl\X = Xo,Y = y,) ■ 




The second derivative, also known as the observed informa¬ 
tion matrix (size mn x mn), is given by 

Inverting this matrix and evaluating at the current MLE 
estimates pn,... ,Pmn results in an estimate of the variance- 
covariance matrix E. The mn diagonal elements of S contain 
the variance estimates for each pij and can be directly used 
to assess how certain we are about them. 

D. Testing for Association 

When dealing with two or more categorical variables, one 
of the first questions generally asked is whether they are 
independent of each other. For two variables to be independent 
their joint distribution must be equal to the product of their 
marginals, i.e. 

P{X,Y) = P{X\Y)P{Y) = P{Y\X)P{X) = P{X)P{Y). 

In practice, this means that knowing the value of X provides 
no valuable information when predicting Y. In this section, we 
explain why the most common statistical test of independence, 
the test HI, is not appropriate for RAPPOR-collected 
variables and propose an alternative test statistic. 

The test is one of the most widely used statistical tech¬ 
niques for testing the independence of two or more categorical 
variables. It proceeds by comparing the observed cell counts 
to what is expected under the independence assumption. The 
formal test statistic is given by 

, _ ^ (O, - 

where Ei is expected number of cell counts under the inde¬ 
pendence assumption and Oi is the observed number of cell 
counts. This test statistic has a known distribution under the 
assumption that X and Y are independent, and it is a x^ 
distribution with (m — l)(n — 1) degrees of freedom. 

Unfortunately, we cannot use the x^ test statistic because 
we do not observe exact cell counts Oi of the co-occurrence 
of our random variables X and Y. Instead, we have mean 
estimates and the corresponding variance-covariance matrix 
obtained through the EM algorithm. 

Weighted quadratic forms of multivariate normal inputs are 
well-behaved with tractable distribution properties. Let 

T = {P- TVy~^{p- t), 

where p is a vector of Pifs, /t is a vector of products of 
marginals (i.e. the expected joint distribution if the variables 
are independent) and S is the estimated variance-covariance 
matrix. Here, T indicates the transpose operation. Under the 
null hypothesis of no association, this test statistic T has a yf 
distribution with (m — l)(n — 1) degrees of freedom, similarly 
to the x^ test. 

In summary, to perform a formal statistical test for indepen¬ 
dence between X and Y, one would use the EM algorithm 


TABLE I 

True joint distribution of X (in rows) and Y (in columns). 



1 

2 

3 

4 

5 

Other 

1 

3.567 

2.937 

2.468 

1.952 

1.639 

6.436 

2 

2.984 

2.432 

1.967 

1.581 

1.289 

5.362 

3 

2.473 

1.991 

1.609 

1.223 

1.025 

4.343 

4 

1.881 

1.569 

1.293 

1.069 

0.874 

3.499 

5 

1.625 

1.292 

1.080 

0.892 

0.662 

2.836 

Other 

6.380 

5.292 

4.311 

3.495 

2.809 

11.863 


TABLE II 

Estimated joint distribution of X (in rows) and Y (in columns). 



1 

2 

3 

4 

5 

Other 

1 

"3. Toe” 

'±951 




TJ9T" 

2 

3.045 

2.292 

2.043 

1.588 

1.286 

5.302 

3 

2.336 

2.173 

1.587 

1.115 

0.916 

4.450 

4 

1.902 

1.506 

1.354 

1.087 

0.887 

3.510 

5 

1.763 

1.233 

1.188 

0.873 

0.615 

2.801 

Other 

6.531 

5.338 

4.245 

3.513 

2.916 

11.419 


to estimate the joint distribution along with the variance- 
covariance matrix. Then, one would compute the T test 
statistic and compare it to the corresponding critical quantile 
qi-a from the xfrn-i){n-i)- would conclude that X and 
Y are not independent if T > qi-a and state that there is 
no evidence for non-independence otherwise. We demonstrate 
numerically that this proposed test statistic has the expected 
behavior in Appendix 

E. Simulation Results 

To illustrate our multivariable analysis of differentially 
private data, we generated synthetic RAPPOR reports for 
variables X and Y, each with 100 unique categories. The 
marginal distributions of X and Y were discretized Zipfian 
distributions, and their (truncated) joint distribution is given 
in Table U 

With 100,000 reports we were able to estimate the frequen¬ 
cies of 15 top categories for each X and Y, on average. For 
the purpose of performing the association analysis, we selected 
the top five categories from each variable as indicated by the 
estimated marginal distribution. First we ignored the “Other” 
categories completely and assumed that X and Y had 5 unique 
values each. 10 Monte Carlo replications were performed; the 
first two panels of Figure plot the estimated cell frequency 
against the true cell frequency for each of the ten trials and 
25 cells. As expected, the estimated 25 proportions are poor 
estimates for both the true joint frequencies and the conditional 
frequencies P{X = x,Y = y\X G top-5, U G top-5). In fact, 
for the conditional probabilities, there is a regression to the 
mean effect where high values are under-estimated and low 
values are overestimated. 

The bottom two panels of Figure show estimates when 
we account for the “Other” categories of both X and Y. The 
estimated joint distribution is now a 6 x 6 table, and the proce¬ 
dure produces unbiased estimates for the true joint frequencies 
(Table [n|. Accordingly, it also produces 25 unbiased estimates 
for conditional frequencies. 
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Fig. 2. Sample size 100,000. True vs. estimated joint frequencies. Red dots show average estimates over 10 Monte Carlo runs. Grey dots show individual 
estimates. Two panels at the top show how ignoring the “Other” category leads to biased estimates for both joint and conditional distributions. For the 
conditional distribution, there’s a regression to the mean effect where high values are underestimated and lower values are overestimated. Accounting for the 
“Other” category fixes the problem and the estimates are close to the truth. 


F. Real-World Example: Google Play Store Apps 

To demonstrate these techniques in a real example, we 
downloaded the public metadata for 200,000 mobile-phone 
apps from the Google Play Store. For each of these apps, 
we obtained the app category (30 categories) and whether it is 
offered for free or not. This information can be summarized in 
a 30 X 2 contingency table. Applying a independence test to 
this contingency table would test whether different categories 
are statistically more likely to feature free apps. In this section, 
we use RAPPOR and our joint decoding approach to learn 
this distribution without direct access to the underlying data 
points. The dataset in this example is not particularly sensitive; 
however, we were unable to find public datasets of sensitive, 
multivariate, categorical data, precisely due to the associated 
privacy concerns. 

For each sampled app, we generated a simulated Basic 
RAPPOR report for both variables: app category and payment 
model. We used 30-bit reports for the category variables, and 
1-bit reports for the Boolean payment model. 

We then performed a joint distribution analysis by estimat¬ 


ing the 30 X 2 contingency table—i.e., the frequency of each 
combination of item category and payment model. Results 
are shown in the second panel of Figure The green points 
show both true and estimated frequencies of free items for 
each category, while the brown points show the paid ones. 
Note that these are the 60 cell frequencies from the true and 
estimated contingency tables, not proportions of free or paid 
apps for each category. 95% confidence intervals are shown 
as horizontal bars for both sets of estimates and have proper 
coverage in all cases. 

The top panel of Figure shows the true and estimated 
paid rate for each category, computed as the proportion of 
paid apps for that category divided by the overall proportion 
of a category. This ratio estimate is less stable than the joint 
frequencies but follows the true rates closely for most app 
categories. 

We perform a formal test for independence by computing 
the proposed x^-test statistic T = 107.093, which has a p- 
value of 6.9523e— 11. This is much smaller than 0.05 and we 
would therefore conclude that there are, in fact, statistically 






significant differences in paid rates between different app 
categories. This can be, of course, clearly seen from the 
top panel where categories are ordered in the descending 
prevalence of paid software, with proportions ranging from 
30% to 4%. 

IV. RAPPOR Without a Known Dictionary 

Suppose we wish to use RAPPOR to learn the ten most 
visited URLs last week. To do this, we could first create an 
exhaustive list of candidate URLs, and then test each candidate 
against received reports to determine which ones are present 
in the sampled population. In this process, it is critical to 
include all potential candidates, since RAPPOR has no direct 
feedback mechanism for learning about missed candidates. 
Such a candidate list may or may not be available, depending 
on what is being collected. For instance, it may be easy to 
guess the most visited URLs, but if we instead wish to learn 
the most common tags in private photo albums, it would be 
impractical to construct a fully exhaustive list. In this section, 
we describe how to learn distribution-level information about 
a population without knowing the dictionary, i.e., the set of 
candidate strings, beforehand. 

To enable the measurement of unknown strings, more in¬ 
formation needs to be collected from clients. In addition to 
collecting a regular RAPPOR report of the client’s full string, 
we will collect RAPPOR reports generated from n-gram^ 
that are randomly selected from the string. The key idea is 
to use co-occurrences among n-grams to construct a set of 
full-length candidate strings. To analyze these co-occurrences, 
we use the joint distribution estimation algorithm developed in 
the previous section. Once we build a dictionary of candidate 
strings, we can perform regular, marginal RAPPOR analysis 
on the full-string reports to estimate the distribution. In the 
extreme case, if our n-grams were as long as the string itself, 
we would be searching for candidates over the space of every 
possible string of a given length. By using small n-grams (2 or 
3 characters long), we can significantly reduce the associated 
computational load, without compromising accuracy. 

Concretely, a client reporting string x with local differential 
privacy budget e would create a report 

X' = RAPPOR(a;) 

by spending a third of her privacy budget (i.e., using differ¬ 
ential privacy level e/3). The other two thirds of e would be 
spent equally on collecting two n-grams 

G'l = RAPPOR(n-gram(a;,pi)) 

and 

G'2 = RAPPOR(n-gram(a;,p2)) 

at distinct random positions gi and g 2 , where n-gram(a;, pi) 
denotes the length-n string starting at the g^th character. In 
principle, the only limitations on gi and p 2 are that gi 7 ^ 52 
and gi,g 2 < M — n; this means that one could choose 

*An n-gram is an n-chai'acter substring. 


partially overlapping n-grams. In our simulations, we impose 
the condition that M is divisible by both gi and 52 , meaning 
that we partition the string into adjacent, non-overlapping n- 
grams. For instance, if our strings have at most M = 6 
characters and our n-grams are two characters each, then 
there are only 3 n-grams per string; gi and p 2 are therefore 
drawn without replacement from the set {1,2,3}. In the 
original RAPPOR paper, each client would report a single 
randomized bit array X'. Our proposed augmented collection 
would instead report jX', G}, G 2 , 51 , 52 }, where both gi and 
52 , the two n-gram positions, are sent in the clear. 

To prevent leakage of information through the length of 
the string, x, the aggregator should specify a maximum string 
length M (divisible by the size of n-grams) and pad all strings 
shorter than M with empty spaces. Strings longer than M 
characters would be truncated and hashed to create X', and 
only Mjn distinct n-grams would be sampled. Information 
in the tail of strings longer than M would be permanently 
lost and can only be recovered by increasing M. There are 
interesting trade-offs involved in the selection of M, which 
should become clear after we describe the decoding algorithm. 

Note that there is nothing wrong with using overlapping n- 
grams; for a fixed number of sampled n-grams, it increases 
redundancy at the expense of coverage, much like the use 
of overlapping windows in spectral signal analysis. Similarly, 
there is nothing theoretically wrong with measuring more than 
two n-grams. However, this would force each n-gram to use 
privacy level e/{r -b 1), where r is the number of n-grams 
measured; this forces the client to send more data to achieve 
the same fidelity on a per-n-gram basis. More problematically, 
using larger numbers of n-grams can significantly increase 
the complexity of estimating n-gram co-occurrences. We will 
discuss these details momentarily, but just to give an example, 
collecting 3 bigrams over the space of only letters requires 
us to estimate a distribution over a sample space with (26^)^ 
possibilities. For this reason, we do not provide simulation 
results based on collecting more than two n-grams. 

A. Building the Candidate Set 

Let N be the number of clients participating in the col¬ 
lection. The aggregator’s reconstruction algorithm proceeds as 
follows: 

1) Build n-gram dictionary: Start by building a subdic¬ 

tionary of every possible n-gram. If the alphabet has D 
elements in it, this subdictionary will have G” elements. 
An example alphabet is D = {0 — 9,a — .}. 

2) Marginal preprocessing: Take the set of all reports 
generated from n-grams, {(G})i, (G^ji}^!. Split this 
set into mutually exclusive groups based on the position 
from which they were sampled. There will be M/n such 
groups. 

3) Marginal decoding: For each position group, perform 
marginal analysis to estimate which n-grams are com¬ 
mon at each position and their corresponding frequen¬ 
cies. This step uses the n-gram dictionary constructed 
in (1). 
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Fig. 3. Estimating the joint distribution of the categories of software items, and whether they are free or for purchase. Categories are ordered by the paid 
fraction, shown in the top panel. The bottom panel plots 60 true and estimated joint frequencies along with 95% confidence intervals shown as horizontal 
bars. Both true and estimated frequencies (free + paid) add up to 1. 


4) Joint preprocessing: Each pair of n-grams falls into 

one of groups, defined by the randomly-chosen 

positions of the two n-grams, gi and g 2 . Split the reports 
into these groups. 

5) Joint analysis: Perform separate joint distribution anal¬ 
yses for each group in (4) using the significant n-grams 
discovered in (3). 

6) n-gram candidates: Select all n-gram pairs with fre¬ 
quency greater than some threshold 6. 

7) String candidates (Graph-building): Construct a graph 
with edges specified by the previously-selected n-gram 
pairs. Analyze the graph to select all M/n-node fully 
connected subgraphs which form a candidate set C. 

Steps (3)-(7) are illustrated in Figure but steps (6) and 
(7) require some more explanation. For simplicity assume that 
M = 6 and that we are collecting two bigrams from each 
client. For string x with frequency /(x), there could only be 
three different combinations of bigram pairs reported by each 
client: ( 51 , 52 ) G {(1,2), (1,3), (2,3)}. If string x is a true 
candidate, then we would expect the corresponding bigrams 
from all three pairings to have frequency of at least /(x) 
in the relevant joint distributions. Additional frequency could 


come from other strings in the dictionary that share the same 
bigrams. In general, all n-gram pairs must have frequency 
greater than some threshold 5 to produce a valid candidate. 
We computed 5 as 

^2(1-52) 

(52 -P2)N' 

where 

52 = 0.5f{p + q) -f (1 - f)q 

and 

P 2 = 0.5f{p + q) -f (1 - f)p. 

This expression is designed to ensure that if an n-gram pair 
has no statistical correlation, then with high probability its 
estimated probability will fall below 5. Indeed, 1.645 is a 
frequency threshold above which we expect to be able to 
distinguish strings from noise in our marginal analysis. We 
deliberately use a slightly lower threshold to reduce our false 
negative rate. 

Step (7) is explained in greater detail in Figure The basic 
idea is to construct a set of candidate strings by building a 
graph and finding fully-connected cliques. Each n-gram at 
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Fig. 4. Process for learning the distribution of a random variable without knowing the dictionary ahead of time. The aggregator computes pairwise joint 
distributions from the noisy reports generated at different n-gram positions. These pairwise joint distributions are used to generate a candidate string dictionary. 


each position is treated as a distinct node. Edges are drawn 
between every valid n-gram pair from step (6). These edges 
may be due to true signal (solid lines) or noise (dotted lines), 
but the aggregator has no way of distinguishing a priori. 
Regardless of provenance, edges are only drawn between 
n-grams of different positions, so the resulting graph is k- 
partite, where k = M/n. Now the task simplifies to finding 
every fully-connected fc-clique in this fc-partite graph; each 
clique corresponds to a candidate string. This works for the 
following reason: If a string x is truly represented in the 
underlying distribution, then the likelihood of any n-gram 
pair having a joint distribution below the threshold 5 is small. 
Therefore, if even a single n-gram pair from string x has a 
significantly lower frequency than <5 after accounting for the 
noise introduced by RAPPOR, then it is most likely a false 
positive. Accordingly, the corresponding edge will be missing 
in the graph, and our clique-finding approach will discard x 
as a candidate string. 

If executed naively, this clique-finding step can become a 
storage and computation bottleneck. In the worst case, the 
number of candidates can grow exponentially in the number 
of bigrams collected. The problem of efficiently finding k- 
cliques in a fc-partite graph has been studied in the context of 
braiding in the textile industry 1 ^ ; this approach significantly 
outperforms traditional branch-and-bound algorithms. 

Candidates in C can be further filtered based on string 
semantics and/or limited prior knowledge. For example, if it 
becomes apparent that what we are collecting are URLs, then 
candidates that do not meet strict URL encoding restrictions 
can be safely removed without further consideration (e.g.. 


Bigram 1 Bigram 2 Bigram 3 



Fig. 5. Graph building process for generating full string candidates. In the 
graph-building phase, we search for fully connected cliques in this fc-partite 
graph. In this graph, the resulting set of candidate strings would be C = 
{rabbit, hermit, hebbit}. The noisy false positive (“hebbit”) gets weeded out 
by candidate testing (section [iV-Bl. 


strings with spaces in the middle and so on). 

B. Testing Candidate Strings 

To estimate the marginal distribution of unknown strings, 
we use the set of full string reports Xj,..., and candidate 
dictionary C to perform marginal inference as described in the 
original RAPPOR paper. False positives in the candidate set C 
will be weeded out in this step, because the marginal decoding 
shows that these strings occur with negligible frequency. 
The marginal analysis here differs from classical RAPPOR 
marginal analysis in two important ways: 

1) Reports X[,..., X'j^ are collected with stronger privacy 
guarantees by using privacy parameter e/3 as opposed 
to e. Depending on the true distribution, this may or 
may not affect the final results, but in general, there is a 
substantial penalty for collecting additional information 
in the form of two n-grams. 



































































2) The estimated candidate set C is unlikely to be as 
complete as an external knowledge-based set. With high 
probability, it will include the most frequent (important) 
candidates, but it will miss less frequent strings due to 
privacy guarantees imposed on n-gram reporting. On 
long-tailed distributions, this means that a signihcant 
portion of distribution mass may fall below the noise 
floor. Set C is also likely to be comprised of many 
false-positive candidates forcing a higher stress load 
on statistical testing that necessarily must be more 
conservative in the presence of a large number of tests. 
The output of this step is the estimated marginal weights of 
the most common strings in the dictionary. 

V. Results 

We performed a series of simulation studies and one real- 
world example to empirically show the utility of the proposed 
approach. Before showing these results, we discuss why this 
scheme does not alter the privacy guarantees of original 
RAPPOR. 

A. Privacy 

Recall that we split the privacy budget evenly between 
the n-grams and the full-string report. For instance, if we 
collect reports on two n-grams and the full string, each 
report will have privacy parameter e/3. It is straightforward to 
show from the definition of local differential privacy that two 
independent measurements of the same datapoint, each with 
differential privacy parameter 7 , will collectively have privacy 
parameter 2y. Moreover, dependent measurements contain less 
information than independent measurements, so the overall 
privacy parameter is at most 27 . Consequently, our n-gram 
based measurement scheme provides the same privacy as a 
single RAPPOR report with differential privacy e. 

Note also that local differential privacy guarantees hold 
even when the aggregator has side information ifBll . For 
instance, the aggregator might wish to study a distribution 
of strings from a small (but unknown) dictionary of English 
words; it might therefore have a prior distribution on bigrams 
that appear in the English language. In this case, differential 
privacy guarantees ensure that the aggregator cannot improve 
its estimate (conditioned on the prior information) by more 
than a factor of 

B. Efficiency 

Recall that \D\ is the size of our alphabet, and r is the 
number of n-grams collected from each string. N denotes 
the number of datapoints. The bottleneck of our algorithm 
is constructing the dictionary of candidate strings. This can be 
split into two phases; (a) computing n-gram co-occurrences, 
and (b) building the candidate dictionary from a fc-partite 
graph of n-gram co-occurrences. Part (a) has complexity 
O (TV 11? I"’’) due to the EM algorithm. Part (b) depends on 
the size of the initial fc-partite graph. If there are p nodes in 
each of the partitions, this part has worst-case computational 
complexity 0{kp^~^). However, due to the significant sparsity 


in this A:-partite graph, this complexity can be much lower in 
practice. 

These asymptotic costs can be prohibitive as the number of 
data samples increases. This is partially because the EM algo¬ 
rithm in phase (a) is iterative, and each iteration depends on 
every data element; this can lead to high memory constraints 
and lengthy runtimes. However, while the complexity of part 
(a) dominates part (b) in most usage scenarios, part (a) can 
also be parallelized more easily. We are therefore working on 
releasing a parallelized version of the EM estimation code. We 
will also show how the parameter 5 can be tuned to reduce 
the computational load of part (b) in exchange for a reduction 
in accuracy. 

C. Simulated Results 

To understand the impact of parameter choices on accuracy 
and efficiency, we built a synthetic dataset comprised of 
fake “hashes”—randomly selected character strings of a fixed 
length. We then specified a distribution over 100 such strings; 
in the following examples, that distribution is a discretized 
Zipflan0We drew 100,000 strings from this distribution, and 
encoded them as 128-bit RAPPOR reports, with parameters 
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Fig. 6. Estimated distribution of hash strings, computed from the RAPPOR 
reports of 100,000 simulated users. This plot only shows results for the 25 
most frequent strings. We were able to correctly estimate the top 7 strings 
without access to a prior dictionary. With the dictionary as prior information, 
we are able to estimate 17 of the top strings. 

p = 0.25, q = 0.75, and / = 0. Based on this set of 
noisy reports and our joint analysis, we estimated the marginal 
distribution of these strings without using any prior knowledge 

^Data generated from other distributions are included in Appendix B 







about them. We limited ourselves to 100,000 strings for the 
sake of computational feasibility while exploring the parameter 
space. However, in this section we will also show results from 
a larger trial on a real dataset with 1,000,000 simulated clients. 

Figure illustrates that our method correctly estimates the 
underlying distribution on average. 

To evaluate how close our estimate was to the true distri¬ 
bution, we used the Hellinger distance, which captures the 
distance between two distributions of discrete random vari¬ 
ables. For discrete probability distributions P and Q defined 
over some set U, the Hellinger distance is defined as 

This metric is related to the Euclidean norm of the distance 
between the square root vectors; we chose it in part because 
unlike Kullback-Leibler divergence, it is defined even when 
the two distributions are nonzero over different sets. 

Accuracy and n-gram length: Figure j^plots the Hellinger 
distance of our reconstructed distribution as a function of 
string length, for different sizes of n-grams[^This figure sug¬ 
gests that for a fixed string length, using larger n-grams gives 
a better estimate of the underlying dictionary. Intuitively, this 
happens for two reasons: (1) Reports generated from longer n- 
grams contain information about a larger fraction of the total 
string; we only collect two n-grams for communication effi¬ 
ciency, so the n-gram size determines what fraction of a string 
is captured by reports. (2) The larger the n-gram, the fewer 
n-gram pairs exist in a string of fixed length. In simulation, we 
observe that the likelihood of our algorithm missing an edge 
between n-grams is roughly constant, regardless of n-gram 
size. Therefore, if there are more n-gram pairs to consider with 
smaller n-grams, the likelihood that at least one of the edges is 
missing—thereby removing that string from consideration—is 
significantly higher. 

This hypothesis is supported by Figure which shows the 
false negative rate as a function of string size for different 
n-gram sizes. In all of these trials, we did not observe any 
false positives, so false negatives accounted for the entire 
discrepancy in distributions. Because our distribution was 
quite peaked (as is the case in many real-life distributions 
over strings), missing even a few strings caused the overall 
distribution distance to decrease significantly. 

Accuracy vs. computational costs: As mentioned previ¬ 
ously, graph-building becomes a bottleneck if the EM portion 
of the algorithm is properly parallelized and optimized. This 
stems from the potentially large number of candidate strings 
that can emerge while searching over the fc-partite graph of n- 
grams. This number depends in part on the threshold 6 used to 
select “significant” associations between n-grams. Choosing 
a larger threshold results in fewer graph edges and lower 
computational load, but this comes at the expense of more 
missed strings in the candidate set. 

^We only generated one point using 4-grams due to the prohibitive memory 
costs of decoding a dictionary with 26'* elements. 


To understand this tradeoff better, we examined the impact 
of the pairwise candidate threshold on accuracy. In principle, if 
we were to set the threshold to zero, we could recover every 
string in the dictionary. However, this greatly increases the 
false positive rate, as well as the algorithmic complexity of 
finding those strings. Eigure plots the Hellinger distance 
of the recovered distribution against the number of edges in 
the candidate n-gram graph for various distribution thresholds. 
The number of edges in the n-gram k-partite graph indirectly 
captures the computational complexity required to build and 
prune candidates. 
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Fig. 7. Hellinger distance of our learned distribution from the true distribution 
as a function of string length. The population size is 100,000 and each data 
point is averaged over 100 trials (illustrated in lighter face color). Each user 
sent information on two n-grams from each string. 


As expected, the computational complexity (i.e. number 
of edges in the candidate graph) decreases as the threshold 
increases. However, counterintuitively, the accuracy decreases 
for very low thresholds. This occurs because each candidate 
string is treated as an independent hypothesis, the null hy¬ 
pothesis being that the candidate is not significant. When 
testing for M independent hypotheses with significance a, 
it is common practice to use Bonferroni correction, which 
reduces the significance of each individual test to a/M in 
order to account for the greater likelihood of seeing rare events 
when there are multiple hypotheses. The net effect of this 
is to impose more stringent significance tests when there are 
more candidates. Since lowering the threshold also increases 
the number of candidate strings, the resulting Bonferroni 
correction causes many true strings to fail the significance test. 
If we did not use Bonferroni correction, we would observe a 
high number of false positives. Due to this effect, we observe 






a clear optimal threshold value in Figure The optimal 
parameter setting is difficult to estimate without extensive 
simulations that depend on the distributional information we’re 
trying to estimate in the first place. However, in simulation 
we observe that the threshold computed analytically in Eq. 
( |IV-A[ )— which is based on the statistics of the randomized 
response noise—appears close to the optimum, and is likely a 
good choice in practice. 
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Fig. 8. False negative rate as a function of string length. This was generated 
from the same data summarized in Figure!^ We observe that discrepancies 
between distributions are primarily caused by false negatives. 

D. Estimating the Dictionary in Real-World Settings 

To understand how this approach might work in a real 
setting, we located a set of 100 URLs with an interesting 
real-world frequency distribution (somewhat similar to the 
Alexa dataset ||2l). We simulated measuring these strings 
through RAPPOR by drawing one million strings from the 
distribution, and encoding each string accordingly. We then 
decoded the reports using varying amounts of knowledge about 
the underlying dictionary of URLs. 

All URL strings were padded with white space up to 20 
characters, matching the longest URL in the set. In addition 
to full string reports, two randomly-chosen bigrams (out of 
10) were also reported, all using 128-bit Bloom filters with 
two hash functions. Overall privacy parameters were set to 
q = 0.75, p = 0.25 and f — 0 (assuming one-time 
collection). This choice of parameters provides e = 4.39 or 
exp(e) = 81 privacy guarantees, deliberately set quite high for 
demonstration purposes only. Each of the collected reports— 
based on the string itself and two bigrams—were allotted equal 
privacy budgets of e/3, resulting in effective parameter choices 
of p = 0.25 and q = 0.32. 


Results are shown in Ligure where we truncate the 
distribution to the top 30 URLs for readability. Each URL’s 
true frequency is illustrated by the green bar. The other three 
bars show frequency estimates for three different decoding 
scenarios. A missing bar indicates that the string was not 
discovered under that particular decoding scenario. 

Under the first scenario, we performed an original RAPPOR 
analysis with e = 4.39 and perfect knowledge of the 100 
strings in the dictionary. With 1 million reports, we were 
able to detect and estimate frequencies for 75 unique strings. 
The second scenario also assumes perfect knowledge of all 
100 strings, but performs collection at e/3 = 1.46. This 
illustrates how much we lose purely by splitting up the privacy 
budget to accommodate sending more information. In this 
second scenario, 23 strings were detected, and their estimated 
frequencies are shown with blue bars. 
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Fig. 9. The Hellinger distance between our recovered distribution and the true 
distribution as a function of the number of edges in the k-partite candidate 
graph. Each cluster was generated using a different edge-creation thresholds 
in the pairwise joint distributions. We observe that optimal threshold value is 
close to our chosen threshold, which can be deterministically computed from 
the randomized response noise pai'ameters. 

In the third scenario, no prior knowledge of the dictionary 
was used. Each string and bigram was collected at privacy level 
e/3 = 1.46. Ten marginal bigram analyses for each bigram 
position returned 4, 2, 2, 3, 4, 5, 2, 1, 1, and 1 signifi¬ 
cant bigrams, respectively. After conducting joint distribution 
analysis on pairs of bigrams, we selected bigram pairs whose 
joint frequency was above the threshold cutoff of 5 = 0.0062. 
We then located the 10-cliques in the corresponding 10- 
partite graph, which produced 896 candidate strings. The final 
marginal analysis based on the full string reports (to weed out 
false positives) discovered the top five strings and estimated 








their frequency quite accurately (pink bars). There was also 
one false positive string identified by the analysis. We also 
reran the collection with trigrams, which produced only 185 
candidate strings. Final marginal analysis resulted in only two 
strings with no false positives. 

A note on accuracy: Unfortunately, our method does not 
detect many of the strings in the population. While we make no 
claims of optimality—either about the RAPPOR mechanism or 
our estimation algorithm—there is a well-studied fundamental 
tension between local differential privacy and data utility. 
Compared to estimating a distribution from N unmasked 
samples, estimating it with locally-differential privacy reduces 
the effective sample size quadratically by e when e < 1 
Q. Since we collect each n-gram with privacy parameter 
e/3, our effective learning rate is slowed down significantly 
compared to regular RAPPOR. Moreover, estimation over 
an unknown dictionary introduces an even greater challenge; 
Worst-case, estimating a multinomial distribution at a given 
fidelity requires a number of samples that scales linearly in 
the support size of the distribution. So if we wish to estimate 
a distribution over an unknown dictionary of 6-letter words 
without knowing the dictionary, in the worst case, we will 
need on the order of 300 million samples—a number that 
grows quickly in string length. Considering these limitations, 
it is to be expected that learning over an unknown dictionary 
will perform significantly worse than learning over a known 
dictionary, regardless of algorithm. Our algorithm nonetheless 
consistently finds the most frequent strings, which account for 
a significant portion of the distribution’s probability mass, both 
in our example and in many distributions observed in practice. 
This enables an aggregator to learn about dominant trends in a 
population without any prior information and without violating 
the privacy of users. 

VI. Related Work 

Since its introduction nearly a decade ago, differential 
privacy has become perhaps the best studied and most widely 
accepted definition of privacy ©. When there is no party 
trusted to construct a database of the sensitive data, the more 
refined notion of local differential privacy is often consid¬ 
ered nsi El m. Research on local differential privacy has 
largely been centered around finding algorithms that satisfy 
differential privacy properties lfT9l ET] l22ll . and improving the 
tradeoffs between privacy and utility CIlMl. 

Our work follows in a recent trend of using local differential 
privacy to learn a distribution’s heavy-hitters —the most signif¬ 
icant categories in a distribution Oil El mis. Several of these 
papers focus on the information-theoretic limits of estimating 
heavy-hitters while satisfying differential privacy. Our paper 
differs from existing work by asking new questions aimed at 
improving the practicality of the recently-introduced RAPPOR 
mechanisms 0. Specifically, we consider two key questions: 
how to decode joint distributions from noisy reports, and how 
to learn distributions when the aggregator does not know the 
dictionary of strings beforehand. 


Our work combats the recent notion that differential privacy 
is mainly of theoretical interest 0. Therefore, we have identi¬ 
fied two of the main technical shortcomings of a differentially- 
private mechanism that has seen practical, real-world deploy¬ 
ment, namely RAPPOR, and provided usable solutions that 
address those shortcomings. 

The question of estimating distributions from differentially 
private data is not new, with Williams et ai first making 
explicit the connection between probabilistic inference and 
differential privacy Esiia. This previous work is similar in 
principle to our approach. There has even been some work 
on the distinct but related problem of releasing differentially 
private marginal distributions generated from underlying mul¬ 
tivariate distributions M- 

However, existing work on distribution estimation from 
differentially private data only considers continuous random 
variables, while our work focuses on discrete random vari¬ 
ables (specifically, strings). This difference leads to significant 
practical challenges not addressed by prior literature, and ad¬ 
dressing those challenges is crucial to improving the RAPPOR 
mechanism. 

Learning the distribution of random strings through differ¬ 
entially private data with an unknown dictionary is, to the best 
of our knowledge, a previously-unstudied question. 

VII. Discussion 

Privacy-preserving crowdsourcing techniques have great 
potential as a means of resolving the tensions between in¬ 
dividuals’ natural privacy concerns and the need to learn 
overall statistics—for each individual’s benefit, as well as for 
the common good. The recently-introduced RAPPOR mech¬ 
anism provides early evidence that such techniques can be 
implemented in practice, deployed in real-world systems, and 
used to provide statistics with some benefits—at least in the 
application domain of software security. In this paper, we have 
addressed two significant limitations of this original RAPPOR 
system—namely its inability to learn the associations between 
RAPPOR-reported variables, and its need to known the data 
dictionary of reported strings ahead of time. Notably, we have 
been able to achieve those improvements without changing 
the fundamental RAPPOR mechanisms or weakening its local 
differential privacy guarantees. 

This said, our new analysis techniques are not without their 
own shortcomings. From a practical deployment perspective, 
the main limitation of our methods is the cost of decoding, 
where the primary bottleneck is the cost of joint estimation. 
Parts of the joint distribution estimation algorithm are paral- 
lelizable, but each iteration of the EM algorithm ultimately 
depends on the previous iteration, as well the entire dataset. If 
the number of users is very large, this can lead to significant 
memory and computational loads. The original RAPPOR 
system gets around this by using a LASSO-based decoding 
scheme that removes the dependency on every individual 
report while returning unbiased estimates of the marginal 
distribution. We cannot use this same approach because we 
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Fig. 10. Learning a distribution of URLs with and without the dictionary. The dictionary contains 100 strings (only the top 30 are shown). The bottom bar 
indicates the URL’s true frequency. The second bar from the bottom shows estimated frequencies when collected at e = 4.39 with full knowledge of the 
dictionary. The third bar also assumes the full knowledge of the dictionary but the collection took place with e/3 = 1.46 (stronger privacy); this illustrates 
losses incurred by allocating 2/3 of the privacy budget to collecting two bigrams. The top bar shows distribution estimates computed without any knowledge 
of the dictionary; each report and ngram was again encoded at e/3 privacy. 


do not have access to the joint count frequencies. Nonethe¬ 
less, a similar lightweight decoding algorithm for multivariate 
distributions would significantly improve the practicality of 
our enhanced RAPPOR analysis. 

Another shortcoming in our current work is the lack of 
optimal methods for parameter selection, and the allocation 
of privacy budgets. In our experiments, we have allocated 1/3 
of the privacy budget to each full-string report and two n- 
gram reports, but this allocation does not necessarily maximize 
estimation accuracy. For instance, it may provide more utility, 
at similar overall levels of privacy, to collect n-grams with 
more relaxed privacy guarantees to get a better estimate 
of the candidate set, and then use stricter privacy settings 
when collecting full string reports. Because our algorithm’s 
performance is distribution-dependent, it is difficult to estimate 
optimal settings theoretically. Moreover, searching over the 
complete parameter space is computationally challenging, due 
to the cost of decoding. We hope that the eventual public 
release of our analysis mechanisms (deferred for blind review) 
will encourage experimentation on both fronts. Furthermore, 
we also aim to tackle these challenges ourselves, in future 


work. 

VIII. Conclusions 

Privacy-preserving crowdsourcing techniques based on ran¬ 
domized response can provide useful, new insights into un¬ 
known distributions of sensitive data, even while providing 
the strong guarantees of local e-differential privacy. As shown 
in this paper, such privacy-preserving statistical learning is 
possible even when there are multiple degrees and levels of 
unknowns. In particular, by augmenting the analysis methods 
of the existing RAPPOR mechanism it is possible to learn 
the joint distribution and associations between two or more 
unknown variables, and learn the data dictionaries of frequent, 
unknown values from even very large domains, such as strings. 
Furthermore, those augmented RAPPOR analysis techniques 
can be practical, and can be of value when applied to real- 
world data. 
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Appendix A 

Testing for Association: Validation 
We wish to demonstrate the validity of our proposed statistic 

for testing the independence of two or more variables collected 

with RAPPOR. 



Fig. 11. Quartiles. 

Consider two independent random variables, X and Y. If 



we test a null hypothesis on these variables at a confidence 
level of a = 0.05, we would expect to falsely reject the null 
hypothesis 5 percent of the time. More precisely, if the test 
statistic is continuous (as ours is), then the p-value is uniformly 
distributed between 0 and 1 if the null hypothesis is true. 

Thus, in order to demonstrate that our proposed statistic 
can be used as a test of independence, we generated a 
pair of distributions of independent random variables, X 
and Y. In each trial, we drew K = 10,000 data points 
{(xi, j/i),..., j/if)}. After encoding these data points 

with RAPPOR and then jointly decoding the reports, we obtain 
estimates pij and S. We use these estimates to compute our 
proposed statistic, T, for a single trial. We did this for 100 
trials. 

Since the null hypothesis is true by construction, the p- 
values of our test should have a distribution that is uniform. 
Therefore, the expected quantiles of our constructed dataset 
should be uniformly spaced between 0 and 1 . Figure [TT] plots 
these expected quantiles against our observed quantiles from 
the dataset. Because the points are well-represented by a 
linear fit with slope 1 and intercept 0 , we conclude that our 
test statistic has the desired properties as a test of variable 
independence. 

Appendix B 

Learning Unknown Dictionaries: 
Dieeerent Distributions 


Throughout this paper, our simulations were run over dis¬ 
crete approximations of Zipfian distributions. This class of 
distributions is common in practice, but for completeness, we 
also include results from other distributions that may arise in 
practice. All plots in this section were generated by running 
the decoding algorithm over N = 100, 000 simulated reports. 
Each report was generated from a string drawn from a spec¬ 
ified distribution over 100 categories. The privacy parameters 
used were p = 0.25, q = 0.75, and / = 0, and we used 128- 
bit Bloom filters. Figure 12 shows the estimated distribution 
when the underlying distribution is a truncated geometric with 
parameter p = 0.3. The geometric distribution is the discrete 
equivalent of an exponential distribution. In terms of estimate 
accuracy, there is little difference between this distribution and 
the Zipfian distribution we used in simulation. 

Figure 13 shows the estimated distribution when the un¬ 
derlying distribution is a synthetic stepwise function. This 


function has four strings with probability mass 0 . 12 , four 
strings with probability mass 0.06, and the remaining 92 
strings share the remaining mass evenly. In this distribution 
we can see that the estimates are less accurate than in the 
geometric distribution. This occurs because in the step distri¬ 
bution, a significant fraction of probability mass lies below the 
noise threshold and gets accumulated in the “Other” category. 
While our approach to handling the “Other” category leads 
to unbiased estimates, it nonetheless relies on approximations 
from noisy data. Therefore, the more significant the “Other” 
category, the less certain we are about the bigram joint 
frequencies, leading to worse estimates overall. The takeaway 
message is that the flatter the distribution (i.e. the greater the 
entropy), the lower the likelihood of our algorithm accurately 
capturing the distribution. Indeed, if we run this algorithm on a 
uniform distribution with each string having probability 0 . 01 , 
none of the strings are found. 
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Eig. 12. Estimated distribution of hash strings without knowing the dictionary 
a priori. The underlying distribution is geometric with p = 0.3. 
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Fig. 13. Estimated distribution of hash strings without knowing the dic¬ 
tionary a priori. The underlying distribution is a synthetic step function we 
constructed. 










